PHD-302 Introduction to Open Data Science 2023!
Check out my GitHub repository here: https://github.com/annatolo/IODS-project
date()
## [1] "Mon Dec 4 13:27:27 2023"
So far, I have found the R for Health and Data Science book and the Exercise Set 1 to be an excellent resource for familiarising myself with R. Since I am a total beginner, Chapter 2 - R basics was my favourite, as comprehending Chapters 3-5 still require a bit more studying on my part… Everything to do with plots seems to be complicated! However, I think I like working with R Markdown in general.
This week, after completing the data-wrangling exercise, I embarked on a statistical exploration of the students2014 dataset, which involved importing, examining the structure, and graphically visualizing student ages and exam scores.
I used histogram to analyze the age distribution and score variability, noting the skewness and outliers that provide insights into the student demographic and academic achievement. Boxplots offered a gender-based comparison of exam points, revealing median performances and exceptional cases.
Then, as per the instructions, I constructed a linear regression model to investigate the influence of students’ learning approaches on exam scores. Despite the model’s modest explanatory power, as indicated by an R-squared value of 4.07%, it did show some interesting points about the significance of the learning methods—deep, strategic, and surface—on academic outcomes.
Finally, I utilized diagnostic plots to validate the regression model’s assumptions, assessing linearity, normality, and the impact of influential data points. These visual tools illustrated the robustness of the model and any potential weaknesses in its fit to the data.
Throughout this process, I enhanced my understanding of data visualization techniques, the interpretation of statistical models, and the critical evaluation of model assumptions. This endeavor sharpened my analytical skills, particularly in applying statistical concepts to real-world educational data using R.
date()
## [1] "Mon Dec 4 13:27:27 2023"
I’m starting out by reading the students2014 data from my local folder.
students2014 <- read.csv("/Users/annatolonen/Desktop/IODS-project/data/learning2014.csv", sep = ",")
Next, I’m moving onto displaying the structure and dimensions of the dataset.
str(students2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: num 3.9 3.3 3.3 3.7 3.5 3.8 3.7 3.7 3.4 2.9 ...
## $ Deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ Stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ Surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
dim(students2014)
## [1] 166 7
I’m starting to show the graphical overview of the data by creating histograms for numerical variables. First, the ages of the students…
hist(students2014$Age)
Description: The histogram illustrates the distribution of ages among the students who participated in the survey. Each bar corresponds to an age interval, showing the number of students that fall into that age bracket.
Interpretation:
Next, the points obtained…
hist(students2014$Points)
Description: This histogram shows the distribution of exam points scored by students. Each bar represents the number of students that achieved a score within a specific range.
Interpretation:
Central Tendency: The most common score range is centered around 20 to 25 points, indicating that this is where the majority of students’ scores lie.
Spread: The distribution of scores spans from approximately 5 points to over 30, showing a wide range in performance among students.
Skewness : The distribution appears to be left-skewed, with a tail extending towards the lower end of points. This suggests that while most students scored around the middle range, a smaller number of students scored significantly lower.
Outliers: Any bars isolated from the others towards the higher end of the scale, could be considered outliers, representing students who scored much higher than their peers. These could be considered positive outliers. In the educational context, such outliers might indicate students who have a particularly strong grasp of the material, possibly due to prior knowledge, natural aptitude, or more effective study strategies. Conversely, isolated bars or data points at the lower end of the scale represent students who scored much lower than the majority. These would be negative outliers. Such outliers could suggest students who may have struggled with the course content or had external factors that negatively impacted their performance.
Implications of Outliers: The presence of outliers, especially if there are many or they are extreme, can have implications for teaching and learning. For example, it might prompt an instructor to consider whether the course materials are accessible to all students or whether additional support could be offered. It might also reflect the need for course content adjustments or highlight the presence of particularly challenging topics that could be addressed differently in the future.
Statistical Considerations: From a statistical perspective, outliers can affect the mean of the data, potentially skewing it away from the median. They can also impact the assumptions of certain statistical tests and models. For example, in regression analysis, outliers can disproportionately influence the slope of the regression line and the overall fit of the model, leading to misleading interpretations.
For comparing the distribution of exam point between female (F) and male (M) students, I’m creating a boxplot graph.
boxplot(Points ~ gender, data = students2014)
Description: The boxplot displays the distribution of exam points for students, segregated by gender. The central box of each plot represents the interquartile range (IQR) where the middle 50% of scores lie. The line within each box indicates the median score. The “whiskers” extend to the furthest points that are not considered outliers, and any points beyond the whiskers are plotted individually as potential outliers.
Interpretation:
Generally, if the goal is to ensure equitable outcomes across genders, the similarity in median and IQR might be encouraging, but the presence of outliers in the female group might warrant a closer look to understand their causes. It’s also important to note that boxplots do not show the distribution’s modality or skewness; hence, the presence of outliers does not necessarily imply a skewed distribution.
Now, to show the relationship between students’ ages and their exam points, I’m creating a scatter plot.
plot(students2014$Age, students2014$Points)
Description: The scatter plot visualizes each student’s exam points against their age. Each dot represents a student, with the horizontal position indicating their age and the vertical position indicating the number of points they scored on the exam.
Interpretation:
Interestingly, when considering the potential impact of maturity and life experience on academic performance, as suggested by the age variable in the scatter plot, the lack of a clear trend may indicate that these factors do not have a straightforward or linear relationship with exam scores in the context of this course.
summary(students2014)
## gender Age Attitude Deep
## Length:166 Min. :17.00 Min. :1.300 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:3.025 1st Qu.:3.333
## Mode :character Median :22.00 Median :3.400 Median :3.667
## Mean :25.51 Mean :3.381 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:3.775 3rd Qu.:4.083
## Max. :55.00 Max. :5.000 Max. :4.917
## Stra Surf Points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
The students2014 dataset reflects a student group with an age range from 17 to 55, indicating diversity in student demographics, though the average age is about 25, suggesting a predominantly young adult cohort. Attitudes towards statistics vary but generally lean positive, with a median score of 3.4 out of 5. Learning approaches show a tendency towards deeper, more strategic engagement, with less emphasis on surface-level learning, as indicated by the higher median scores for Deep and Stra and a lower median for Surf. Exam points are distributed across a wide range, from 7 to 33, with a median of 23, suggesting varied academic performance among the students.
Now, I’m choosing three variables as explanatory variables and fitting a regression model where exam points is the target (dependent, outcome) variable. Given that ‘Deep’, ‘Stra’, and ‘Surf’ represent different learning approaches and could potentially influence exam performance, they seem like reasonable choices for explanatory variables.
Fitting the linear regression model and showing its summary
model <- lm(Points ~ Deep + Stra + Surf, data = students2014)
summary(model)
##
## Call:
## lm(formula = Points ~ Deep + Stra + Surf, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.1235 -3.0737 0.5226 4.2799 10.3229
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.9143 5.1169 5.260 4.5e-07 ***
## Deep -0.7443 0.8662 -0.859 0.3915
## Stra 0.9878 0.5962 1.657 0.0994 .
## Surf -1.6296 0.9153 -1.780 0.0769 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.827 on 162 degrees of freedom
## Multiple R-squared: 0.04071, Adjusted R-squared: 0.02295
## F-statistic: 2.292 on 3 and 162 DF, p-value: 0.08016
To interpret the results, I’m applying the following principles:
Generating diagnostic plots for the linear regression model and setting up the plotting area to display 4 plots (2x2 layout)
par(mfrow = c(2, 2))
plot(model)
Explanations for the plots above:
Interpreting the plots:
date()
## [1] "Mon Dec 4 13:27:27 2023"
I’m starting out by reading the students2014 data from my local folder (and loading the necessary libraries).
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
joined_data <- read.csv("/Users/annatolonen/Desktop/IODS-project/data/joined_and_modified_data.csv")
Next, I’m printing the names of the variables.
print(names(joined_data))
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
Description of the dataset: The dataset used in this chapter is a compilation of information from two separate studies on student performance from secondary schools in Portugal, which has been merged to provide a comprehensive look at various factors that might impact student success. It includes demographic details, academic achievements, and specific lifestyle choices, with a particular focus on alcohol consumption patterns. The inclusion of variables such as average alcohol use (alc_use) and a binary indicator of high alcohol consumption (high_use) allows for an analysis of the potential influence of alcohol on academic performance. This dataset can be analysed to understand and possibly predict student performance in relation to their personal and social habits.
# Numerically exploring the variables using summary for each
summary(joined_data$G3)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 10.00 12.00 11.52 14.00 18.00
summary(joined_data$freetime)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 3.000 3.224 4.000 5.000
summary(joined_data$goout)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 3.116 4.000 5.000
summary(joined_data$studytime)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 2.043 2.000 4.000
# Cross-tabulations of 'high_use' with 'freetime', 'goout', and 'studytime'
table(joined_data$high_use, joined_data$freetime)
##
## 1 2 3 4 5
## FALSE 15 46 112 65 21
## TRUE 2 14 40 40 15
table(joined_data$high_use, joined_data$goout)
##
## 1 2 3 4 5
## FALSE 19 82 97 40 21
## TRUE 3 15 23 38 32
table(joined_data$high_use, joined_data$studytime)
##
## 1 2 3 4
## FALSE 56 128 52 23
## TRUE 42 57 8 4
Next, for the graphical exploration, I’m going to create bar plots for ‘the variables ’freetime’, ‘goout’ and ‘studytime’ agains ‘alc_use’.
ggplot(joined_data, aes(x = freetime, y = alc_use)) +
geom_bar(stat = "summary", fun = "mean") +
labs(title = "Average Alcohol Consumption by Free Time", x = "Free Time", y = "Average Alcohol Use")
ggplot(joined_data, aes(x = goout, y = alc_use)) +
geom_bar(stat = "summary", fun = "mean") +
labs(title = "Average Alcohol Consumption by Going Out", x = "Going Out", y = "Average Alcohol Use")
ggplot(joined_data, aes(x = studytime, y = alc_use)) +
geom_bar(stat = "summary", fun = "mean") +
labs(title = "Average Alcohol Consumption by Study Time", x = "Study Time", y = "Average Alcohol Use")
To explore academic performance (‘G3’) agins ‘high = use’, I’m going to create a box plot.
ggplot(joined_data, aes(x = as.factor(high_use), y = G3)) +
geom_boxplot() +
labs(title = "Final Grade (G3) by High Alcohol Consumption", x = "High Alcohol Use", y = "Final Grade (G3)")
Overall, the results of the exploration align with my previously stated hypotheses. There is an indication that higher alcohol consumption is related to lower academic performance, more free time, increased socialization with friends, and less study time.
# Fitting the logistic regression model using 'high_use' as the response variable
# and G3, freetime, goout, and studytime as predictors
model <- glm(high_use ~ G3 + freetime + goout + studytime,
data = joined_data, family = "binomial")
# Getting a summary of the fitted model
summary(model)
##
## Call:
## glm(formula = high_use ~ G3 + freetime + goout + studytime, family = "binomial",
## data = joined_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.95210 0.76876 -2.539 0.011108 *
## G3 -0.03499 0.03931 -0.890 0.373397
## freetime 0.10832 0.13728 0.789 0.430067
## goout 0.70101 0.12248 5.723 1.04e-08 ***
## studytime -0.58983 0.16798 -3.511 0.000446 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 386.02 on 365 degrees of freedom
## AIC: 396.02
##
## Number of Fisher Scoring iterations: 4
# Calculating and printing the odds ratios and confidence intervals for the coefficients
exp_coef <- exp(coef(model))
exp_confint <- exp(confint(model))
## Waiting for profiling to be done...
# Creating a data frame to nicely format the odds ratios and their confidence intervals
odds_ratio_df <- data.frame(
Variable = names(exp_coef),
OddsRatio = exp_coef,
Lower95CI = exp_confint[,1],
Upper95CI = exp_confint[,2]
)
# Printing the odds ratios and confidence intervals
print(odds_ratio_df)
## Variable OddsRatio Lower95CI Upper95CI
## (Intercept) (Intercept) 0.1419750 0.03046126 0.6255685
## G3 G3 0.9656148 0.89381353 1.0432400
## freetime freetime 1.1144073 0.85141137 1.4606593
## goout goout 2.0157819 1.59543764 2.5817002
## studytime studytime 0.5544194 0.39446586 0.7636077
# install.packages("broom")
library(broom)
# Tidying the model to include exponentiated coefficients (odds ratios) and confidence intervals
tidy_model <- tidy(model, exponentiate = TRUE, conf.int = TRUE)
# Printing the tidy model with odds ratios
print(tidy_model)
## # A tibble: 5 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.142 0.769 -2.54 0.0111 0.0305 0.626
## 2 G3 0.966 0.0393 -0.890 0.373 0.894 1.04
## 3 freetime 1.11 0.137 0.789 0.430 0.851 1.46
## 4 goout 2.02 0.122 5.72 0.0000000104 1.60 2.58
## 5 studytime 0.554 0.168 -3.51 0.000446 0.394 0.764
Now, to interpret…
In summary, the logistic regression analysis indicates that two of the variables, goout and studytime, have statistically significant associations with high alcohol consumption. Students who go out more often are more likely to have high alcohol consumption, while students who spend more time studying are less likely to. These results are in line with my hypotheses for these variables. However, academic performance (G3) and the amount of free time students have (freetime) do not show statistically significant effects on high alcohol consumption in this analysis.
Now, I’m fitting my logistic regression model using only the predictors that were determined to be significant above.
# Fitting the logistic regression model with significant predictors
model <- glm(high_use ~ goout + studytime, data = joined_data, family = "binomial")
# Making predictions on the training data
joined_data$predicted_high_use <- if_else(predict(model, type = "response") > 0.5, TRUE, FALSE)
# Creating a 2x2 cross-tabulation of predictions vs actual values
confusion_matrix <- table(Actual = joined_data$high_use, Predicted = joined_data$predicted_high_use)
# Calculating the total proportion of inaccurately classified individuals (the training error)
training_error <- mean(joined_data$high_use != joined_data$predicted_high_use)
# Determining the most frequent class in the actual data
most_frequent_class <- names(which.max(table(joined_data$high_use)))
# Calculating the error rate for the guessing strategy
guessing_strategy_error <- mean(joined_data$high_use != most_frequent_class)
# Printing the confusion matrix, training error, and guessing strategy error
print(confusion_matrix)
## Predicted
## Actual FALSE TRUE
## FALSE 238 21
## TRUE 70 41
print(paste("Training error: ", training_error))
## [1] "Training error: 0.245945945945946"
print(paste("Guessing strategy error: ", guessing_strategy_error))
## [1] "Guessing strategy error: 0.3"
# Creating a graphic visualization of actual values vs predictions
ggplot(joined_data, aes(x = as.factor(high_use), fill = as.factor(predicted_high_use))) +
geom_bar(position = "fill") +
labs(title = "Actual vs Predicted High Alcohol Use", x = "Actual High Use", y = "Proportion", fill = "Predicted") +
scale_y_continuous(labels = scales::percent_format())
# Loading the necessary library for cross-validation
library(boot)
# Defining the logistic regression model formula
model_formula <- high_use ~ goout + studytime
# Creating the glm model for cross-validation
glm_model <- glm(model_formula, data = joined_data, family = "binomial")
# Performing 10-fold cross-validation using the cv.glm function
set.seed(123) # for reproducibility
cv_results <- cv.glm(joined_data, glm_model, K = 10)
# Printing the cross-validation results
print(cv_results)
## $call
## cv.glm(data = joined_data, glmfit = glm_model, K = 10)
##
## $K
## [1] 10
##
## $delta
## [1] 0.1750966 0.1749537
##
## $seed
## [1] 10403 624 -983674937 643431772 1162448557 -959247990
## [7] -133913213 2107846888 370274761 -2022780170 -412390145 848182068
## [13] -266662747 -1309507294 1356997179 1661823040 1749531457 -516669426
## [19] 1042678071 -1279933428 -410084963 1151007674 -895613453 1288379032
## [25] -376044615 -1358274522 307686511 101447652 1796216213 -1567696558
## [31] 1186934955 -1925339152 -472470735 80319294 -1524429145 326645436
## [37] -389586803 -400786966 -890731933 -852332472 1365217705 -1785317034
## [43] -1551153185 1359863956 2098748037 -1013039742 -329721061 -1587358816
## [49] 344102689 -1520389522 166492183 1821136236 1646453629 1056605210
## [55] -1419044141 -806080008 520985497 711286406 2004844367 -1445006012
## [61] 1329781621 -1188844110 -1089068661 1173875536 -1983217903 514629022
## [67] -237421177 -258138084 -930078099 261626442 1349308227 -1125425240
## [73] -1677778551 25874358 409637567 -1987430924 1583257701 -136173086
## [79] 639501307 272101120 -1024630015 -1994369842 -939499785 -1944742196
## [85] -591520419 -1994900358 1072996275 1119025496 2035491705 -2082894618
## [91] 776176175 -69557596 1794806101 -178474478 -497581461 874372784
## [97] 518669041 -370223106 1295572071 -1776240260 -1692674995 1935534762
## [103] 298421283 111542024 -1075273367 518297110 -289321569 1331379028
## [109] 1768277573 1473660482 2120850651 879016544 -864018719 1661675310
## [115] 135902679 -2136373204 735594301 1594631386 -546138989 1423929528
## [121] -1067541671 1962863430 -1923418865 -984154108 1907308341 642901618
## [127] -1095019701 -1836613104 -1171392815 1663582814 -1258689721 -2007301412
## [133] -756910547 -712643830 -1271482109 -801485208 51646793 -1925477258
## [139] -1421379457 1104736436 -1348082651 -124611934 292791739 2126591424
## [145] -2043491647 -709285490 -1242530633 1688217996 -538353379 -1997652678
## [151] -48432781 575772696 942146361 57506214 -948054033 -72610460
## [157] 1389939989 656100050 -25586645 -2012424848 1854773937 1391516862
## [163] -2100008409 -140248004 -1638135795 -2077746326 -118729245 -1417654840
## [169] 662270249 942125782 -1363864737 744183316 2123821573 -80802046
## [175] -1753997669 1277518112 1090348705 1338137582 423408535 -28214548
## [181] 1164536573 1524008346 673959507 853634936 -1599644903 -2135083002
## [187] -345756977 -1070478652 971985653 -556736718 -406174453 663083216
## [193] 1258368657 1306568478 1820350727 -1068259940 -402617875 1499233226
## [199] -1121819965 -1773142744 1796260105 1463879990 901914175 104491892
## [205] 1605431269 -1933329566 1688405883 -446088064 1238889089 197049934
## [211] -709469577 -1072333748 1691378909 -1260585478 198644531 2053568216
## [217] 903127801 -1970919834 -473567825 1614821412 -1905604395 1082827666
## [223] 1558537707 1875545136 1518383729 -1265655426 -2085242905 1791098620
## [229] 1447558093 -1153758166 -99557469 -92185464 -2016280343 1847562134
## [235] 1495701791 -221368108 409722309 -429353022 1765302363 2137311200
## [241] -373658015 273043630 -350916265 -935055956 43404989 52012634
## [247] 1867958291 1488596536 -1347953959 174081222 2002460815 1429165444
## [253] -205312331 1264621554 -603785525 1270017936 -1543231919 -1282028578
## [259] 908887751 726075484 1269456301 -1680094070 -990917501 -1377014808
## [265] -1279971127 1281050102 228230143 1097770548 -1438663771 1295361058
## [271] 829172027 988808000 1704778305 804328206 -1257113545 -516583668
## [277] -1624037219 1034190522 904064243 -1716316776 1108935353 904106790
## [283] 1222361967 1146561252 1232110741 174767186 2136668075 -1843985680
## [289] 713263665 1133192766 1302119847 -499465796 -425742451 2035727594
## [295] 1324820835 -227988664 -1598926679 227290198 601218783 1836305300
## [301] 1386514821 306372738 -445226469 618852000 -25741791 156697966
## [307] -345772265 -2126405524 1998516861 -392853734 1588822483 1965665528
## [313] -1658840423 -1901588090 -687876529 -15753148 -1427453323 -1799286606
## [319] -47880053 97437264 -319365615 688369822 -272731001 469052188
## [325] 27259245 1573117258 -446761405 1976539816 2093047945 424297142
## [331] 1217440191 506831092 -1961736347 -1834464030 1234111227 907381248
## [337] -247365119 118499278 -1581033993 -893361716 -2100188067 335855482
## [343] 83920563 -1896483752 -323673479 -498745370 2088720687 -2102342236
## [349] 1873412181 226202898 -1483060885 1437743536 -430562831 -190616834
## [355] -1639345305 281953404 857940813 -549769814 -245419229 1375189512
## [361] -237346711 590186774 75687071 655107668 151057733 930998594
## [367] -1108466725 1398789472 1995685345 1605663278 1206398167 -1945513172
## [373] 1992513085 1544169434 1610742675 -152048712 -657450407 1247059526
## [379] 1880247311 -124605692 723920437 -1548596878 1827773003 479812880
## [385] 228152785 49698142 922100295 -1524757028 -845069011 534031882
## [391] -131080189 213485928 636833865 718143350 -1134260353 -2024842316
## [397] -1108831451 1977333154 1053535419 1301926080 -997856831 366738574
## [403] -1450544201 1064694924 -1016336355 -390217670 -1024466829 686789400
## [409] -2056715719 745319590 -999248145 -1240647580 -1395180523 -1837290030
## [415] -681354453 -514051984 1438153137 2090364862 -209968857 1765574460
## [421] -544057587 -844603798 -1693909789 -1746073400 -1156960215 2076419542
## [427] -1326601633 1784103188 -683597563 -824593918 1683989915 -509903840
## [433] 183502241 -132206866 -295556457 190629356 -1790739971 1849133210
## [439] -1660799661 214755960 -1837639143 975563526 1750237647 1014527428
## [445] 3490293 552878642 220695563 382907344 -1381266031 1445050910
## [451] 1771278343 -1719553892 862869741 583941834 -1759344189 1365915688
## [457] -820969463 -1381598154 -19516097 662427252 -1098735899 -812655006
## [463] 1658982011 -1203972224 1999245697 -1592487602 -1708699273 -1038727348
## [469] -725486627 747602170 2037447219 -161484328 469017081 1897421158
## [475] 644859055 959210276 1824012245 -1573943662 -797561621 466937648
## [481] 6984049 1344943230 -1963692313 507873788 1336756941 -446804182
## [487] -978024797 50927496 -66994199 -1542552938 -1630130145 1108679636
## [493] 421858501 286669506 176875355 1716904672 841747809 2002101166
## [499] -1936594857 -503678804 643784125 -270685862 -9162989 -1518294728
## [505] -1177069095 450623430 -1518307441 -2055143292 1977097653 1967586034
## [511] 2139569611 993708688 887981393 -146153762 -1521041977 -1948249252
## [517] 1992764589 1735430026 469169027 -492722456 1473540041 -1902921482
## [523] 1705351935 1769673012 -929011035 948225826 -946720709 1824431680
## [529] 1626208577 -1384520178 22671159 -1788782068 -359417955 272236986
## [535] -230435853 1174868120 -2145910343 -855063002 1748802159 651054564
## [541] -619908203 89300818 345161387 -1411621392 774662449 -1541883586
## [547] 1651670183 581520572 -1489764723 -2028142614 -1423847325 -1844713912
## [553] 1954615209 -389144746 66876895 2030417556 -361973627 -151813246
## [559] -1573918437 944703904 610784545 1108957294 -1875417577 -1297945748
## [565] 1037500797 1908181530 823650515 1875585016 -22111847 1765196934
## [571] -849597105 1315720004 -1748059787 -915770446 634433419 -1869504176
## [577] -887145199 2066662302 -939545721 -822528484 -1687437203 -1367629750
## [583] -1603461821 522180008 1610588041 2052437430 110280895 2014120948
## [589] -670960027 159018978 1050415611 568272128 -1718509311 -3409202
## [595] 753028343 -1139331892 -123651235 -2072165766 -1222087245 648343384
## [601] 1100161401 486404838 261566511 1504901284 -476745899 1151760402
## [607] -445050773 -130902864 -423755535 1831075326 934693479 690474876
## [613] -907644339 -744197974 1158732323 62223624 -1538777239 1455586326
## [619] -702514273 -1712778924 651699269 959548482 -586241317 1850142816
## [625] -647799583 2099891502
# Calculating the average prediction error
cv_error <- cv_results$delta[1]
# Printing the average prediction error
print(paste("10-fold CV average prediction error: ", cv_error))
## [1] "10-fold CV average prediction error: 0.175096615513868"
# Comparing to the model error from the Exercise Set
exercise_set_error <- 0.26
print(paste("Is the cross-validated model better than the Exercise Set model? ", cv_error < exercise_set_error))
## [1] "Is the cross-validated model better than the Exercise Set model? TRUE"
It seems that my model does have a smaller prediction error using 10-fold cross-validation compared to the model from the Exercise Set. Yay!
In this chapter, we explore the Boston Housing dataset, which comprises various attributes of housing areas around the Boston, Massachusetts area, as recorded in the 1970s. It’s a rich dataset often used for understanding the housing market through statistical learning techniques. As per the assignment, I begin by loading the data and examining its structure—highlighting key variables like crime rates, property tax rates, and median home values. Next, I try to provide a visual and statistical summary, discussing each variable’s distribution and interrelationships. Then, I standardized the data to prepare for more complex analyses, including clustering and linear discriminant analysis (LDA), which reveal insights into the socio-economic patterns affecting housing values.
Through this assignment, I’ve learned new statistical learning techniques. I gained insights into housing market patterns by performing exploratory data analysis, standardization, clustering, and discriminant analysis, and enhanced my data visualization skills further.
date()
## [1] "Mon Dec 4 13:27:29 2023"
# Loading and Exploring the Boston Dataset
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
##Graphical Overview of the Data
Next I’m going to use ‘pairs’ to create pair-wise scatter plots for an overview of relationships and ‘summary’ to give a statistical summary of each variable.
pairs(Boston)
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
###Distributions of the Variables
crim): The distribution is
highly right-skewed, with most suburbs exhibiting low crime rates, but a
few have extremely high crime rates.zn): This
variable is also right-skewed, indicating that many suburbs have no land
zoned for large residential plots.indus): Displays a
more varied distribution with a noticeable peak around 18, suggesting a
common proportion of industrial businesses across suburbs.chas):
As a binary variable, the data shows that most suburbs do not border the
river.nox):
Exhibits a slight right-skewness, suggesting that while most areas have
moderate nitric oxide levels, some areas have significantly high
levels.rm): Appears
to be normally distributed, with most suburbs having around the median
number of 6.2 rooms.age): The distribution is
left-skewed with many houses being older, peaking at 100 years.dis):
The right-skewed nature indicates that most suburbs are close to
employment centers, with a few being much further away.rad): The bimodal distribution suggests that
suburbs are typically either very close or very far from highways.tax): Also bimodal
and likely correlated with rad, indicating variations in
tax rates depending on highway accessibility.ptratio):
Shows slight left-skewness, with most suburbs having a higher
ratio.black): This
variable is left-skewed, with most areas having a high proportion,
although some have significantly low proportions.lstat): Right-skewed, most suburbs have a lower
proportion of lower-status population.medv):
Right-skewed with a ceiling effect at 50, indicating capped median
values in the dataset.rm and medv: A positive
correlation suggests that suburbs with more rooms tend to have higher
median home values.lstat and medv: A visible
negative correlation implies that suburbs with a higher percentage of
lower status have lower home values.nox and indus: A positive
correlation indicates that more industrial areas have higher nitric
oxide concentrations.dis and nox: A negative
correlation suggests that areas further from employment centers have
lower concentrations of nitric oxides.age and nox: There seems
to be a trend where older houses are in areas with higher nitric oxide
concentrations.rad and tax: A high
correlation indicates that suburbs with better highway access tend to
have higher tax rates.# Installing and loading the caret package
if (!require(caret)) {
install.packages("caret")
library(caret)
}
## Loading required package: caret
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:boot':
##
## melanoma
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
# Standardizing the dataset
scaled_Boston <- scale(Boston)
# Printing out summaries of the scaled data
summary(scaled_Boston)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# Creating a categorical variable of the crime rate using quantiles
Boston$crime_cat <- cut(Boston$crim, breaks=quantile(Boston$crim, probs=seq(0, 1, by=0.25)), include.lowest=TRUE)
# Dropping the old crime rate variable from the dataset
Boston <- Boston[,-which(names(Boston) == "crim")]
# Dividing the dataset into train and test sets (80% train, 20% test)
trainIndex <- createDataPartition(Boston$crime_cat, p = .8, list = FALSE, times = 1)
train_set <- Boston[trainIndex, ]
test_set <- Boston[-trainIndex, ]
# Loading the MASS package for LDA
library(MASS)
# Fitting the LDA model using the categorical crime rate as the target variable
lda_fit <- lda(crime_cat ~ ., data = train_set)
# Summarizing the LDA fit
lda_fit
## Call:
## lda(crime_cat ~ ., data = train_set)
##
## Prior probabilities of groups:
## [0.00632,0.082] (0.082,0.257] (0.257,3.68] (3.68,89]
## 0.2512315 0.2487685 0.2487685 0.2512315
##
## Group means:
## zn indus chas nox rm age
## [0.00632,0.082] 34.867647 4.389902 0.03921569 0.4482049 6.618569 42.48235
## (0.082,0.257] 9.386139 9.156832 0.05940594 0.4887515 6.224386 57.50495
## (0.257,3.68] 2.415842 12.474653 0.12871287 0.5974059 6.357050 79.36832
## (3.68,89] 0.000000 18.114510 0.06862745 0.6821078 6.013686 91.23627
## dis rad tax ptratio black lstat
## [0.00632,0.082] 5.699057 3.490196 284.4510 17.52059 391.9486 6.928235
## (0.082,0.257] 4.621321 4.792079 322.8119 18.27822 385.2170 11.579208
## (0.257,3.68] 3.041023 6.009901 355.9208 17.78119 362.6541 12.779703
## (3.68,89] 2.011357 23.813725 663.4216 20.14608 294.5307 18.627157
## medv
## [0.00632,0.082] 27.76961
## (0.082,0.257] 22.87822
## (0.257,3.68] 24.26337
## (3.68,89] 16.03627
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.0058638802 0.0336333261 -0.038838035
## indus 0.0097365728 -0.0770542250 0.071930089
## chas -0.3058741802 -0.2575565730 0.151796895
## nox 3.2547006401 -4.9810251040 -11.481520459
## rm -0.1253415623 -0.1113155702 -0.184244380
## age 0.0099843038 -0.0084262780 -0.007954078
## dis -0.0470505232 -0.1668911643 0.156760508
## rad 0.3708158600 0.0936078668 0.028139720
## tax -0.0003962842 0.0010585594 0.001539570
## ptratio 0.0693013006 0.0365044515 -0.129862507
## black -0.0013254302 0.0004298002 0.001222392
## lstat 0.0345479590 -0.0358900241 0.046297776
## medv 0.0251083797 -0.0384011550 -0.024467292
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9483 0.0399 0.0119
# Plotting the LDA model
plot(lda_fit)
# Saving the actual crime categories from the test set
actual_crime_categories <- test_set$crime_cat
# Removing the categorical crime variable from the test set
# Using the LDA model to predict crime categories on the test set
predicted_crime_categories <- predict(lda_fit, newdata=test_set)$class
test_set <- test_set[,-which(names(test_set) == "crime_cat")]
# Cross-tabulating the predicted vs actual crime categories
confusion_matrix <- table(Predicted = predicted_crime_categories, Actual = actual_crime_categories)
# Printing the confusion matrix
confusion_matrix
## Actual
## Predicted [0.00632,0.082] (0.082,0.257] (0.257,3.68] (3.68,89]
## [0.00632,0.082] 13 2 0 0
## (0.082,0.257] 8 18 8 0
## (0.257,3.68] 4 5 16 0
## (3.68,89] 0 0 1 25
Based on the confusion matrix above we can evaluate the performance of the classification models as follows:
Key Observations from the results:
Next, I’m going to perform k-means clustering on the standardized Boston dataset to identify clusters within the data.
# Reload the Boston dataset
data("Boston")
# Standardizing the dataset
scaled_Boston <- scale(Boston)
# Calculating the distances between observations
distances <- dist(scaled_Boston)
# Installing and loading the factoextra package
if (!require(factoextra)) {
install.packages("factoextra")
library(factoextra)
}
## Loading required package: factoextra
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
# Determining the optimal number of clusters using the elbow method
set.seed(123)
fviz_nbclust(scaled_Boston, kmeans, method = "wss") +
geom_vline(xintercept = 4, linetype = 2) +
labs(subtitle = "Elbow method")
# Running k-means with the determined optimal number of clusters
set.seed(123)
kmeans_result <- kmeans(scaled_Boston, centers = 4)
# Creating a data frame for plotting
plot_data <- as.data.frame(scaled_Boston)
plot_data$cluster <- factor(kmeans_result$cluster)
# Visualizing the clusters using two variables from the dataset
ggplot(plot_data, aes(x = rm, y = lstat)) +
geom_point(aes(color = cluster)) +
labs(color = 'Cluster')
library(MASS)
data("Boston")
# Standardizing the dataset
scaled_Boston <- scale(Boston)
Next, performing the K-means clustering
set.seed(123) # for reproducibility
# According to the assignment, a reasonable number of clusters is >2, here I'm choosing 4
kmeans_result <- kmeans(scaled_Boston, centers = 4)
Now, performing LDA using clusters as target classes
# Adding the cluster assignments as a factor to the Boston data
Boston$cluster <- factor(kmeans_result$cluster)
# Fitting LDA model using the clusters as target classes
library(MASS) # for LDA
lda_fit <- lda(cluster ~ ., data = Boston)
Finally, visualizing the results with Biplot
# Biplot for LDA with arrows for original variables
plot(lda_fit)
# Examining the model's coefficients
lda_fit$scaling
## LD1 LD2 LD3
## crim -0.033252529 0.093706888 -0.003875224
## zn -0.003852921 0.005772233 -0.047264096
## indus -0.107232590 -0.066745602 -0.041117990
## chas 0.068548580 -0.197356300 0.410601305
## nox -7.766232173 -5.409634182 -7.673642525
## rm 0.139330592 0.341290080 -0.794324360
## age 0.003903578 -0.004331691 0.015529903
## dis -0.001695810 -0.034881757 -0.176213368
## rad -0.063928129 -0.015823030 -0.007533964
## tax -0.004651926 -0.002905580 -0.003796952
## ptratio -0.159544007 -0.041917959 -0.038208490
## black 0.008358645 -0.020444688 -0.004007031
## lstat -0.028069814 0.013570317 -0.056673688
## medv 0.005932882 0.013682649 -0.086102624
The LDA scaling coefficients reveal that nox (nitric oxides concentration) is the predominant variable influencing the separation of clusters on the first discriminant (LD1), indicating its strong role in differentiating the clusters. rm (average number of rooms) emerges as the most significant for the second discriminant (LD2), suggesting its importance in further distinguishing between clusters. On the third discriminant (LD3), chas (proximity to Charles River) has the highest coefficient, highlighting its influence in cluster separation at this level. These variables—nox, rm, and chas—are therefore the most critical linear separators for the clusters, with their varying scales and units contributing to their discriminant weights.
Now, the goal is to project the train data using the LDA model’s scaling matrix and then visualize the projections in 3D.
library(dplyr)
# Here 'train' is the training set and 'lda_fit' is my LDA model from before
model_predictors <- Boston[, -which(names(Boston) == "cluster")]
# Checking the dimensions
dim(model_predictors)
## [1] 506 14
dim(lda_fit$scaling)
## [1] 14 3
# Matrix multiplication to get the projections
matrix_product <- as.matrix(model_predictors) %*% lda_fit$scaling
matrix_product <- as.data.frame(matrix_product)
Now, onto the 3D visualization…
# Installing and loading the plotly package
if (!require(plotly)) {
install.packages("plotly")
library(plotly)
}
## Loading required package: plotly
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# Creating a 3D scatter plot using the LDA projections
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type = 'scatter3d', mode = 'markers', color = Boston$cluster) %>%
layout(scene = list(xaxis = list(title = 'LD1'),
yaxis = list(title = 'LD2'),
zaxis = list(title = 'LD3')))
Interpretation:
In this chapter, we are working with the Human Development dataset, examining a variety of indicators that reflect the status of education, labor force participation, economic prowess, and gender equality across various countries. In this chapter, the new techniques used include standardization — a crucial step to neutralize scale discrepancies and ensure an unbiased analysis — as well as Principal Component Analysis (PCA) to distill the essence of the data. PCA is employed both before and after standardization, to understand the most influential factors shaping human development patterns.
This assignment not only further improved my statistical understanding but also gave me valuable insights into the socio-economic web that binds countries together. Through this assignment, I’ve become more comfortable with the intricacies of dimensionality reduction, used to make abstract concepts tangible.
date()
## [1] "Mon Dec 4 13:27:31 2023"
library(tidyverse)
# Loading the 'human' dataset
human <- read_csv("/Users/annatolonen/Desktop/IODS-project/data/human.csv")
## Rows: 155 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (8): Female_Secondary_Education, Female_Labor_Force_Participation, Expec...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Converting 'human' from a tibble to a standard data frame
human <- as.data.frame(human)
# Moving country names to rownames
rownames(human) <- human$Country
human <- human[,-which(names(human) == "Country")]
# Statistical summaries for each variable
summary(human)
## Female_Secondary_Education Female_Labor_Force_Participation
## Min. : 0.90 Min. :13.50
## 1st Qu.: 27.15 1st Qu.:44.45
## Median : 56.60 Median :53.50
## Mean : 55.37 Mean :52.52
## 3rd Qu.: 85.15 3rd Qu.:61.90
## Max. :100.00 Max. :88.10
## Expected_Years_of_Education Life_Expectancy GNI_per_Capita
## Min. : 5.40 Min. :49.00 Min. : 581
## 1st Qu.:11.25 1st Qu.:66.30 1st Qu.: 4198
## Median :13.50 Median :74.20 Median : 12040
## Mean :13.18 Mean :71.65 Mean : 17628
## 3rd Qu.:15.20 3rd Qu.:77.25 3rd Qu.: 24512
## Max. :20.20 Max. :83.50 Max. :123124
## Maternal_Mortality_Ratio Adolescent_Birth_Rate
## Min. : 1.0 Min. : 0.60
## 1st Qu.: 11.5 1st Qu.: 12.65
## Median : 49.0 Median : 33.60
## Mean : 149.1 Mean : 47.16
## 3rd Qu.: 190.0 3rd Qu.: 71.95
## Max. :1100.0 Max. :204.80
## Female_Representation_in_Parliament
## Min. : 0.00
## 1st Qu.:12.40
## Median :19.30
## Mean :20.91
## 3rd Qu.:27.95
## Max. :57.50
# Checking for missing values
if(any(is.na(human))) {
stop("Missing values found in the dataset. Please address these before proceeding.")
}
For the graphical overview, I’m first going to examine the distribution of each variable in the dataset using faceted histograms. This approach allows us to see the distribution of each variable side by side for easy comparison.
library(ggplot2)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
# Creating a temporary data frame with country names as a column
human_temp <- human
human_temp$Country <- rownames(human)
# Melting the temporary data for faceting
human_long <- melt(human_temp, id.vars = "Country")
# Creating faceted histograms
ggplot(human_long, aes(x = value)) +
geom_histogram(bins = 30, fill = "blue", color = "black") +
facet_wrap(~ variable, scales = "free") +
theme_minimal() +
labs(x = "Value", y = "Frequency", title = "Distribution of Variables in the Human Dataset")
Interpretation:
Overall, the skewness in many of the variables indicates that there are disparities among countries, with a significant number of countries having lower indicators of human development and gender equality. The bimodal distribution in female secondary education suggests a gap between countries with very high and very low educational attainment for females.
Now, to explore the relationships between the variables, I’m going to use Scatter Plot Matrix (SPLOM)…
# Creating an enhanced scatter plot matrix using GGally
library(GGally)
# Since I've already set country names as row names, I'll drop the row names because ggpairs expects a data frame without row names.
human_no_rownames <- human
rownames(human_no_rownames) <- NULL
# Using ggpairs to create the scatter plot matrix (I'm using ggpairs over basic plot functions to efficiently visualize the pairwise relationships and distribution of each variable, facilitating a comprehensive initial assessment.)
ggpairs(human_no_rownames,
lower = list(continuous = wrap("points", size = 0.5, alpha = 0.5)),
upper = list(continuous = wrap("cor", size = 4)),
diag = list(continuous = wrap("barDiag")),
axis.labels = 'show',
axis.label.size = 2,
legend.size = 5
) +
theme_bw() +
theme(
text = element_text(size = 4),
axis.text.x = element_text(angle = 45, hjust = 1) # Rotating the x axis labels for better visibility
)
Observations:
Now, I will, as per the assignment instructions, perform Principal Component Analysis (PCA) on the raw (non-standardized) human data. The goal is to show the variability captured by the principal components and to visualize the data using a biplot that displays the observations by the first two principal components.
library(ggfortify)
library(ggrepel)
# Performing PCA on Raw Data
pca_raw <- prcomp(human, scale. = FALSE)
# Extracting PCA loadings
loadings_matrix <- as.data.frame(pca_raw$rotation)
# Getting the proportion of variance explained by the PCs
explained_variance <- round(pca_raw$sdev^2 / sum(pca_raw$sdev^2) * 100, 2)
# Biplot for Raw Data PCA without automatic labeling
biplot_pca <- autoplot(pca_raw, data = human, label = FALSE, alpha = 0.5) # Set label to FALSE
# Adding arrows for loadings
biplot_with_arrows <- biplot_pca +
geom_segment(data = loadings_matrix,
aes(x = 0, y = 0, xend = PC1, yend = PC2),
arrow = arrow(type = "closed", length = unit(0.1, "inches")),
color = "blue", size = 1)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Using ggrepel for label placement without duplication and with smaller font size
biplot_with_arrows <- biplot_with_arrows +
geom_text_repel(data = loadings_matrix,
aes(x = PC1, y = PC2, label = rownames(loadings_matrix)),
size = 3,
box.padding = unit(0.2, "lines"),
point.padding = unit(0.3, "lines"),
segment.size = 0.2)
# Axis titles with PC values and variance explained
biplot_with_arrows <- biplot_with_arrows +
labs(x = paste("PC1 -", explained_variance[1], "% Variance"),
y = paste("PC2 -", explained_variance[2], "% Variance"))
# Printing the plot
biplot_with_arrows
Interpretation:
The first principal component (PC1) accounts for an overwhelming majority of the variance (99.99%), indicating that a single dimension captures almost all the information in the data. This is a strong indication that one or a few variables with large numeric scales are dominating the PCA, which is a common occurrence when the variables are not standardized.
In this biplot, the arrows represent the original variables, with their direction and length indicating how each variable contributes to the principal components. The arrows pointing towards the right, along the PC1 axis, correspond to ‘GNI per Capita’, ‘Life Expectancy’, ‘Female Secondary Education’, and ‘Female Labor Force Participation’, suggesting these variables are positively correlated with each other and significantly contribute to the variance in PC1. This could imply that countries with higher GNI per capita tend to also have higher life expectancy, greater female secondary education, and higher female labor force participation.
Conversely, the arrow for ‘Maternal Mortality Ratio’ is directed almost entirely down the PC2 axis, indicating that this variable is somewhat independent of the others and contributes uniquely to the second principal component. This suggests that the maternal mortality ratio varies in a way that is not captured by the variation in PC1.
The cluster of points (countries) near the center of the biplot suggests that many countries have similar scores on these components, indicating comparable levels of the socio-economic indicators measured. Countries further away from the center along PC1 or PC2 represent outliers with significantly different profiles from the average.
Now, I will standardize the variables of the human dataset to give each variable equal weight in our PCA. After standardization, I will perform PCA again to identify the principal components that best capture the variability in our data.
library(factoextra)
# PCA on Standardized Data
human_standardized <- scale(human)
pca_standardized <- prcomp(human_standardized, scale. = TRUE)
# Summary of PCA - Standardized Data
explained_variance <- round(pca_standardized$sdev^2 / sum(pca_standardized$sdev^2) * 100, 2)
# Biplot for Standardized Data PCA
fviz_pca_biplot(pca_standardized, label = "var", repel = TRUE,
ggtheme = theme_minimal(), labelsize = 3,
pointsize = 2, alpha = 0.5, col.var = "blue",
xlab = paste("PC1 -", explained_variance[1], "% of Variance"),
ylab = paste("PC2 -", explained_variance[2], "% of Variance"))
The biplot of PCA on standardized data captures the interplay
between development indicators such as education and health (PC1) and
gender equality and reproductive health (PC2), reflecting the
multifaceted nature of human development across different
countries.
Interpretation and comparison to raw data PCA: The principal component analysis (PCA) on the non-standardized data portrayed an atypical scenario where the first principal component (PC1) dominated the variance (99.99%). This suggested that the scale of certain variables, likely those with larger numerical ranges like ‘GNI per Capita’, heavily influenced the analysis. Such a result often obscures the true relationships between variables and can misrepresent the underlying structure of the data.
After standardizing the variables, the PCA results are notably different. The first two principal components now explain a combined total of approximately 72.63% of the variance, with PC1 accounting for 57.03% and PC2 for 15.60%. This indicates a more equitable distribution of variance across the components, highlighting the multidimensional nature of human development. The standardization process gives each variable equal weight by adjusting for scale differences, thereby allowing for a more accurate reflection of the data’s structure.
In the standardized biplot, ‘Female Secondary Education’, ‘Life Expectancy’, and ‘Expected Years of Education’ are prominent along PC1, suggesting that this component might represent a general development factor, encompassing health and education. In contrast, ‘Maternal Mortality Ratio’ and ‘Adolescent Birth Rate’ are depicted with significant negative loadings on PC2, contrasting reproductive health challenges against the positive social advancements indicated by ‘Female Labor Force Participation’ and ‘Female Representation in Parliament’. This biplot emphasizes the intersection between socio-economic status, health, and gender equality, offering insights into how these aspects co-vary across nations.
The difference in the results with and without standardization underscores why it is crucial to standardize variables in PCA when they are on different scales. Standardization mitigates the risk of misinterpretation that can arise from the disproportionate influence of higher-magnitude variables. Consequently, the analysis with standardized data is more reliable for understanding the true relationships between the variables and for making inferences about the underlying phenomena they represent.
The biplot from the PCA on standardized human data depicts a multidimensional snapshot of socio-economic and health indicators across countries. The first principal component (PC1), which explains 57.03% of the variance, seems to capture an overarching dimension of human development. The positive loadings on PC1 for ‘Female Secondary Education’, ‘Life Expectancy’, and ‘Expected Years of Education’ suggest that this principal component might represent the general well-being and development status of a country. Countries with higher scores on PC1 are likely those with better educational outcomes, longer life expectancy, and overall higher development indices.
The second principal component (PC2) explains 15.60% of the variance and appears to capture aspects related to reproductive health and gender equality. The positive direction of ‘Female Labor Force Participation’ and ‘Female Representation in Parliament’ on PC2 could be interpreted as a dimension of gender empowerment, reflecting a country’s progress in integrating women into the workforce and political spheres. Conversely, the negative direction of ‘Maternal Mortality Ratio’ and ‘Adolescent Birth Rate’ suggests that this component also contrasts countries with reproductive health challenges against those with more favorable conditions for women’s health and empowerment.
These two dimensions, represented by PC1 and PC2, seem to encapsulate the interplay between educational and health status, economic conditions, and gender equality measures, providing a nuanced landscape of human development. Countries situated further along PC1 and PC2 in the positive direction might be those with better performance in these indicators, while those further in the negative direction might face challenges in these areas. This interpretation aligns with the broader understanding that development is a multifaceted phenomenon, influenced by a complex mix of economic, social, and health factors.
# Loading the tea dataset
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
# Converting 'age' into a categorical variable (binned age groups)
tea$age_group <- cut(tea$age,
breaks = c(15, 24, 34, 44, 54, 64, Inf),
labels = c("15-24", "25-34", "35-44", "45-54", "55-64", "65+"),
right = FALSE)
tea$age_group <- factor(tea$age_group)
# Removing the original 'age' variable now that we have 'age_group'
tea_mca <- tea[, !names(tea) %in% "age"]
# Loading necessary libraries
library(FactoMineR)
library(factoextra)
library(dplyr)
library(ggplot2)
# Performing MCA on the preprocessed dataset
mca_results <- MCA(tea_mca, graph = FALSE)
# Creating a data frame with variable names and total contributions
var_contributions <- as.data.frame(mca_results$var$contrib)
var_contributions$variable <- rownames(var_contributions)
var_contributions$total_contrib <- rowSums(var_contributions[ , 1:5])
# Identifying the top contributing variables
top_contrib_vars <- var_contributions %>%
arrange(desc(total_contrib)) %>%
head(20) %>%
.$variable
# Visualizing the variable biplot with a selection of contributing variables
fviz_mca_var(mca_results, choice = "var",
repel = TRUE,
ggtheme = theme_minimal(),
labelsize = 3,
pointsize = 2,
alpha = 0.5)
Interpretation:
Comments on the Biplot Output: